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Abstract. In 2007, Eickmeyer et al. showed that the tree topologies 
outputted by the Neighbor- Joining (NJ) algorithm and the balanced 
minimum evolution (BME) method for phylogenetic reconstruction are 
each determined by a polyhedral subdivision of the space of dissimilarity 
maps R^s)^ where n is the number of taxa. In this paper, we will analyze 
the behavior of the Neighbor- Joining algorithm on five and six taxa and 
study the geometry and combinatorics of the polyhedral subdivision of 
the space of dissimilarity maps for six taxa as well as hyperplane repre- 
sentations of each polyhedral subdivision. We also study simulations for 
one of the questions stated by Eickmeyer et al., that is, the robustness of 
the NJ algorithm to small perturbations of tree metrics, with tree models 
which are known to be hard to be reconstructed via the NJ algorithm. 



1 Introduction 

The Neighbor-Joining (NJ) algorithm was introduced by Saitou and Nei [13] 
and is widely used to reconstruct a phylogenetic tree from an alignment of DNA 
sequences because of its accuracy and computational speed. The NJ algorithm 
is a distance-based method which takes all pairwise distances computed from 
the data as its input, and outputs a tree topology which realizes these pairwise 
distances, if there is such a topology (see Fig.[T]). Note that the NJ algorithm is 
consistent, i.e., it returns the additive tree if the input distance matrix is a tree 
metric. If the input distance matrix is not a tree metric, then the NJ algorithm 
returns a tree topology which induces a tree metric that is "close" to the input. 
Since it is one of the most popular methods for reconstructing a tree among 
biologists, it is important to study how the NJ algorithm works. 

A number of attempts have been made to understand the good results ob- 
tained with the NJ algorithm, especially given the problems with the inference 
procedures used for estimating pairwise distances. For example, Bryant showed 
that the Q- criterion (defined in ( [Q] ) in Section [2.2p is in fact the unique selection 
criterion which is linear, permutation invariant, and consistent, i.e., it correctly 
finds the tree corresponding to a tree metric [5]. Gascuel and Steel gave a nice 
review of how the NJ algorithm works [T5] . 

One of the most important questions in studying the behavior of the NJ al- 
gorithm is to analyze its performance with pairwise distances that are not tree 
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Fig. 1. The NJ algorithm takes a matrix of pairwise distances (left) as input and 
computes a binary tree (right). If there is a tree such that the distance matrix 
can be obtained by taking the length of the unique path between two nodes. NJ 
outputs that tree. 

metrics, especially when all pairwise distances are estimated via the maximum 
likelihood estimation (MLE). In 1999, Atteson showed that if the distance es- 
timates are at most half the minimal edge length of the tree away from their 
true value then the NJ algorithm will reconstruct the correct tree [1] . However, 
Levy et al. noted that Atteson's criterion frequently fails to be satisfied even 
though the NJ algorithm returns the true tree topology [10| . Recent work of [11] 
extended Atteson's work. Mihaescu et al. showed that the NJ algorithm returns 
the true tree topology when it works locally for the quartets in the tree [11] . This 
result gives another criterion for when then NJ algorithm returns the correct tree 
topology and Atteson's theorem is a special case of Mihaescu et al.'s theorem. 

For every input distance matrix, the NJ algorithm returns a certain tree 
topology. It may happen that the minimum Q-criterion is taken by more than 
one pair of taxa at some step. In practice, the NJ algorithm will then have to 
choose one cherry in order to return a definite result, but for our analysis we 
assume that in those cases the NJ algorithm will return a set containing all tree 
topologies resulting from picking optimal cherries. There are only finitely many 
tree topologies, and for every topology t we get a subset Dt of the sample space 
(input space) such that for all distance matrices in Dt one possible answer of the 
NJ algorithm is t. We aim at describing these sets Dt and the relation between 
them. One notices that the Q-criteria are all linear in the pairwise distances. 
The N J algorithm will pick cherries in a particular order and output a particular 
tree t if and only if the pairwise distances satisfy a system of linear inequalities, 
whose solution set forms a polyhedral cone in R^^), Eickmeyer et al. called such 
a cone a Neighbor- Joining cone, or NJ cone. Thus the NJ algorithm will output 
a particular tree t if and only if the distance data lies in a union of N J cones [3] . 

In [3] , Eickmeyer et al. studied the optimality of the N J algorithm compared 
to the balanced minimum evolution (BME) method and focused on polyhedral 
subdivisions of the space of dissimilarity maps for the BME method and the NJ 
algorithm. Eickmeyer et al. also studied the geometry and combinatorics of the 
NJ cones for n = 5 in addition to the BME cones for n < 8. Using geometry 
of the NJ cones for n = 5, they showed that the polyhedral subdivision of the 
space of dissimilarity maps with the NJ algorithm does not form a fan for n > 5 
and that the union of the NJ cones for a particular tree topology is not convex. 
This means that the NJ algorithm is not convex, i.e., there are distance matrices 
D, D', such that NJ produces the same tree ti on both inputs D and £>', but it 
produces a different tree t2 ti on the input {D + D')/2 (see [3] for an example). 



In this paper, we focus on describing geometry and combinatorics of the NJ 
cones for six taxa as well as some simulation study using the NJ cones for one 
of the questions in [3], that is, what is the robustness of the NJ algorithm to 
small perturbations of tree metrics for n = 5. This paper is organized as follows: 
In Section [2] we will describe the NJ algorithm and define the NJ cones. Section 
[3] states the hypcrplane representations of the NJ cones for n = 5. Section [5] 
describes in summary the geometry and combinatorics of the NJ cones for n — 6. 
Section [5] shows some simulation studies on the robustness of the N J algorithm 
to small perturbations of tree metrics for n = 5 with the tree models from |13| . 
We end by discussing some open problems in Section [6l 



2 The Neighbor-Joining algorithm 
2.1 Input data 

The NJ algorithm is a distance-based method which takes a distance matrix, a 
symmetric matrix ((ij;j)o<i.j<?i-i with da = representing pairwise distances of 
a set of n taxa {0, 1, . . . ,n — 1}, as the input. Through this paper, we do not 
assume anything on an input data except it is symmetric and da = 0. Because of 
symmetry, the input can be seen as a vector of dimension m :— Q) — in(?7 — 1). 
We arrange the entries row- wise. Wc denote row/column-indices by pairs of 
letters such as a, &, c, d, while denoting single indices into the "flattened" vector 
by letters The two indexing methods are used simultaneously in the 

hope that no confusion will arise. Thus, in the four taxa example wc have do.i = 
di,o = do. In general, we get di — da,b = db^a with 
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and for c > d wc get 
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2.2 The Q-Criterion 

The NJ algorithm starts by computing the so called Q-criterion or the cherry 
picking criterion, given by the formula 

n—l n—1 

qa,b ■= (n -2)da,b -^da,k -^dk,b- (Q) 

fc=0 k=0 

This is a key of the NJ algorithm to choose which pair of taxa is a neighbor. 

Theorem 1 (Saitou and Nei, 1987, Studier and Keppler, 1988 [T41IT6] ). 

Let da^b for all pair of taxa {a, b} be the tree metric corresponding to the tree 
T . Then the pair {x, y} which minimizes qafi for all pair of taxa {a, 6} forms a 
neighbor. 



Arranging the Q-criteria for all pairs in a matrix yields again a symmetric 
matrix, and ignoring the diagonal entries we can see it as a vector of dimension 
m just like the input data. Moreover, the Q-criterion is obtained from the input 
data by a linear transformation: 

q = A(")d, 

and the entries of the matrix are given by 

n — 4 if i = j, 

A^f-4:L-<|-1 if*7^J-and{a,6}n{c,d}^0, (1) 
^0 else, 

where a > 6 is the row/column- index equivalent to i and likewise for c > d and 
j. When no confusion arises about the number of taxa, we abbreviate A^") to A. 

After computing the Q-criterion q, the NJ algorithm proceeds by finding the 
minimum entry of q, or, equivalently, the maximum entry of — q. The two nodes 
forming the chosen pair (there may be several pairs with minimal Q-criterion) 
are then joined ("cherry picking"), i.e., they are removed from the set of nodes 
and a new node is created. Suppose out of our n taxa {0, . . . , n — 1}, the first 
cherry to be picked is m — 1, so the taxa n — 2 and n — 1 are joined to form a 
new node, which we view as the new node number n — 2. The reduced pairwise 
distance matrix is one row and one column shorter than the original one, and by 
our choice of which cherry we picked, only the entries in the rightmost column 
and bottom row differ from the original ones. Explicitly, 

for 0<i< ("2 ') 




d.+(„-2) - drn-i) for ("-2) < I < ("-!) 

and we see that the reduced distance matrix depends linearly on the original 
one: 

d' = i?d, 

with R = (nj) e M(™-"+i)x™^ where 

1 for < i = 3 < ("2 2) 

1/2 ^or{^^-')<^<{^^-'),J=^ 
1/2 for {"^') <z J =z + n-2 

-1/2 for ("2^) <»< ("2')'^ =™-! 
else 

The process of picking cherries is repeated until there are only three taxa left, 
which are then joined to a single new node. 

We note that since new distances d' are always linear combinations of the 
previous distances, all Q-criteria computed throughout the NJ algorithm are 
linear combinations of the original pairwise distances. Thus, for every possible 



tree topology t outputted by the NJ algorithm (and every possible ordering a of 
picked cherries that results in topology t) , there is a polyhedral cone 
of dissimilarity maps. The NJ algorithm will output t and pick cherries in the 
order a iff the input lies in the cone Ct.o-- We call the cones CT,a Neighbor- 
Joining cones, or NJ cones. 

2.3 The shifting lemma 

We first note that there is an n-dimensional linear subspace of K"* which does 
not affect the outcome of the NJ algorithm (see [H]). For a node a we define its 
shift vector Sa by 

fl ifae{6,c} 
^^")^-^=|o else 

which represents a tree where the leaf a has distance 1 from all other leaves 
and all other distances are zero. The Q-criterion of any such vector is —2 for 
all pairs, so adding any linear combination of shift vectors to an input vector 
does not change the relative values of the Q-criteria. Also, regardless of which 
pair of nodes we join, the reduced distance matrix of a shift vector is again a 
shift vector (of lower dimension), whose Q-criterion will also be constant. Thus, 
for any input vector d, the behavior of the NJ algorithm on d will be the same 
as on d + s if s is any linear combination of shift vectors. We call the subspace 
generated by shift vectors S. 

We note that the difference of any two shift vectors is in the kernel of A, and 
the sum of all shift vectors is the constant vector with all entries equal to n. If 
we fix a node a then the set 

{sa - Sfa 1 6 ^ a} 

is linearly independent. 

2.4 The first step in cherry picking 

After computing the Q-critcrion q, the NJ algorithm proceeds by finding the 
minimum entry of it, or, equivalently, the maximum entry of — q. The set cqi C 
K'" of all q-vectors for which qi is minimal is given by the normal cone at the 
vertex — to the (negative) simplex 

An-i = conv{-e, I < i < TO - 1} C M™, 

where eo, . . . , Cm-i are the unit vectors in K™. The normal cone is defined in 
the usual way by 

J^A^.A~e.,) :={x e R™ I (-e„x) > (p,x) for p E Am-i} 

= {x e R" I (-e„ x) > (-Cj, x) for < j < m - 1} , ^ ' 

with (•, •) denoting the inner product in M™. 



Substituting q — AA into ^ gives 



q S eg,; <^ i = argmax(— e^, Ad) 

<^ i = argmax(— A"^ej, d) (3) 
<=i- i = argmax(— Acj, d) because A is symmetric. 

Therefore the set cdi of all 'parameter vectors d for which the NJ algorithm will 
select cherry i in the first step is the normal cone at —Aei to the polytope 

P„ := conv{-Aeo, . . . , -Ae„,_i}. (4) 

The shifting lemma implies that the affine dimension of the polytope P„ is at 
most m — n. Computations using polymake show that this upper bound gives 
the true value. 

If equality holds for one of the inner products in this formula, then there are 
two cherries with the same Q-criterion. 

As the number of taxa increases, the resulting polytope gets more compli- 
cated very quickly. By symmetry, the number of facets adjacent to a vertex 
is the same for every vertex, but this number grows following a strange pat- 
tern. See Table 1 for some calculated values. We also computed f- vectors for 
P„ via polymake. With n = 5, we have (1,10,45,90,75,22,1), with n = 6, 
(1, 15, 105, 435, 1095, 1657, 1470, 735, 195, 25, 1), and with n > 7 we ran polymake 
over several hours and it took more than 9GB RAM. Therefore, we could not 
compute them. 
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Table 1. The polytopes P„ for some small numbers of taxa n. 



2.5 The cone cdi 

Equation ([3]) allows us to write cdi as an intersection of half-spaces as follows: 

cd, = {x I {-Aei,x) > (-Aej,x) for j ^ i} 
= {x I (-A(e, - e,), x) > for j ^ ^} 

= f|{x|(-A(e,-e,),x)>0} 



We name the half-spaces, their interiors and the normal vectors defining them 
as follows: 

hif := -A(")(e,-e,), 

4^):={xeM™|(h!;\x)>0}, (6) 
i?^) := {xeM™|(h!;\x)>0}, 

where again we omit the superscript (n) if the number of taxa is clear. 

If there are more than four taxa, then this representation is not redundant: 
For any pair i and j of cherries, we can find a parameter vector d lying on the 
border of Hij (i.e., (h^j, d) = 0) but in the interior Hik of the other half-spaces 
for k 7^ One such d is given by 




2 if fc = I or fc = j, 
4 else. 



Thus we have found an ^-representation of the polyhedron cdi consisting of 
only m — 1 inequalities. Note that Table 1 implies that a V-representation of the 
same cone would be much more complicated, as the number of vectors spanning 
it is equal to the number of facets incident at the vertex —Aei. 

Example 1. The normal vectors to the 22 facets of P5, and thus the rays of the 
normal cones to P5, form two classes (see Fig. [2]). The first class contains a total 
of 12 vectors (as there are 12 assignments of nodes to 4 to the labels a to 
e which yield nonisomorphic labelings), and every normal cone contains six of 
them. The second class contains 10 vectors, and again every normal cone has six 
of them as rays. For each class there are two diagrams in Fig. [21 and we obtain 




3/ 



Fig. 2. Diagrams describing the facet-normals of P5. 

a normal vector to one of the facets of P5 by assigning nodes from {0, . . . , 4} 
to the labels a, . . . , e. The left diagram tells which vertices of P5 belong to the 
facet defined by that normal vector: Two nodes in the diagram arc connected by 
an edge iff the vertex belonging to that pair of nodes is in the facet. The edges 
in the right diagram are labeled with the distance between the corresponding 
pair of nodes in the normal vector. This calls for an example: Setting a = 0, . . . , 
e = 4, Fig. [2Ua) gives a distance vector 



{dm,do2, c?24, dsi)'^ = (-1, 1, 1, -1, -1, 1, 1, -1, 1, -1)'^, 



which is a common ray of the cones cdoi, cdi2, cd23, cd^^ and cc?o4- Thus of the 
22 facets of P5, 12 have five vertices and 10 have six vertices. 



3 The NJ cones for five taxa 

In the case of five taxa there is just one unlabeled tree topology (cf. Fig. [3]) and 
there are 15 distinct labeled trees: We have five choices for the leaf which is not 
part of a cherry and then three choices how to group the remaining four leaves 
into two pairs. For each of these labeled topologies, there are two ways in which 
they might be reconstructed by the NJ algorithm: There are two pairs, any one 
of which might be chosen in the first step of the NJ algorithm. 




Fig. 3. (a) A tree with five taxa (b) The same tree with all edges adjacent to 
leaves reduced to length zero. The remaining two edges have lengths a and /3. 

For distinct leaf labels a, b and c G {0, 1,2,3,4} we define Cab,c to be the 
set of all input vectors for which the cherry a-b is picked in the first step and c 
remains as single node not part of a cherry after the second step. For example, 
the tree in Fig.[31Ja) is the result for all vectors in Cio,2UC43_2- Since for each tree 
topology ((a, 6), c, (d, e)) (this tree topology is written in the Newick format) for 
distinct taxa a, b, c,d,e G {0, 1,2,3, 4}, the NJ algorithm returns the same tree 
topology with any vector in the union of two cones Cab,c U Cde,c, there are 30 
such cones in total, and we call the set of these cones C. 

3.1 Permuting leaf labels 

Because there is only one unlabeled tree topology, wc can map any labeled 
topology to any other labeled topology by only changing the labels of the leafs. 
Such a change of labels also permutes the entries of the distance matrix. In this 
way, we get an action of the symmetric group S5 on the input space R^°, and 
the permutation a € S5 maps the cone Cab.c linearly to the cone Ca-(a)cr(h),cr(c)- 
Therefore any property of the cone Cab,c which is preserved by unitary linear 
transformations must be the same for all cones in C, and it suffices to determine 
it for just one cone. 

The action of S5 on R^^ decomposes into irreducible representations by 



'4 m ID'S 



where the first summand is the subspace of all constant vectors and the second 
one IS the kernel of A'^^\ The sum of these two subspaces is exactly the space 
iS* generated by the shift vectors. The third summand, which we call W , is the 
orthogonal complement of S and it is spanned by vectors Wab,cd in W with 







if xy 
if xy 
else 



ab or xy — cd 
ac or xy = bd 



where a, b, c and d arc pairwise distinct taxa in {0, 1,2,3,4} and {wab,cd)xy is 
the x-yih coordinate of the vector Wab,cd- One linearly independent subset of 
this is 



wi := W01,34, 



W2 



Wl2,40, W3 := «;23,01, 



W34,12, W.5 



Note that the 5-cycle (01234) of leaf labels cyclically permutes these basis vec- 
tors, whereas the transposition (01) acts via the matrix 



T := - 
2 



/2 1 1 1 1\ 

1-1-1 -1 
0-1-1 1-1 
0-1 1-1-1 
yO -1-1-1 1/ 



Because a five-cycle and a transposition generate S'5, in principle this gives us 
complete information about the operation. 



3.2 The cone 043,2 

Since we can apply a permutation cr S S'5, without loss of generality, we suppose 
that the first cherry to be picked is cherry 9, which is the cherry with leaves 3 
and 4. This is true for all input vectors d which satisfy 

(h9,„d) >0fori = 0,...,8, 

where the vector 

hi;) :=-A(")(e,-e,) 

is perpendicular to the hyperplanc of input vector for which cherries i and j 
have the same Q-criterion, pointing into the direction of vectors for which the 
Q-criterion of cherry i is lower. 

We let ri, T2 and ra be the first three rows of - i?(5) . If (n , d) is maximal 
then the second cherry to be picked is 0-1, leaving 2 as the non-cherry node, and 
similarly Y2 and lead to non-cherry nodes 1 and 0. This allows us to define the 
set of all input vectors d for which the first picked cherry is 3-4 and the second 
one is 0-1: 



C34.2 := {d I (hg d) > for i = 0, . . . , 8, and (n - r2, d) > 0, (ri - rg, d) > 0}. 

(7) 



We have defined this set by 11 bounding hyperplanes. However, in fact, the 
resulting cone has only nine facets. A computation using polymake [5] reveals 
that the two hyperplanes hg,i and hg^2 are no longer faces of the cone, while the 
other nine hyperplanes in ([7]) give exactly the facets of the cone. That means 
that, while we can find arbitrarily close input vectors d and d' such that with 
an input d the NJ algorithm will first pick cherry 3-4 and with an input d' the 
NJ algorithm will first pick cherry 1-2 (or 0-2), we cannot do this in such a way 
that d will result in the labeled tree topology of Fig. [31 where 2 is the lonely 
leaf. 

Also note that C, the set of NJ cones which the NJ algorithm returns the 
same tree topology with any vector in the union of two cones Cah.c U Cde.c: is 
not convex which is shown in [5] . For details of geometry and combinatorics of 
the NJ cones for n = 5, see [3]. 

4 The six taxa case 

Note that since each of the NJ cones includes constraints for five taxa, the 
union of the NJ cones which gives the same tree topology is not convex. To 
analyze the behavior of NJ on distance matrices for six taxa, we use the action 
of the symmetric group as much as possible. However, in this case we get three 
different classes of cones which cannot be mapped onto each other by this action. 
We assume the cherry which is picked in the first step to consist of the nodes 4 
and 5. Picking this cherry replaces these two nodes by a newly created node 45, 
and we have to distinguish two different cases in the second step (see Fig. |4]): 

— If the cherry in the second step does not contain the new node 45, we may 
assume the cherry to be 01. For the third step, we again get two possibilities: 

• The two nodes 45 and 01 get joined in the third step. We call the cone 
of input vectors for which this happens Ci. 

• The node 45 is joined to one of the nodes 2 and 3, without loss of 
generality, to 3. We call the resulting cone Cn. 

— If the cherry in the second step contains the new node 45, we may assume 
the other node of this cherry to be 3, creating a new node 45 — 3. In the 
third step, all that matters is which of the three nodes 0, 1 and 2 is joined 
to the node 45 — 3, and we may, without loss of generality, assume this to 
be node 2. This gives the third type of cone. Cm. 

The resulting tree topology for the cone Ci is shown in Fig.[5Ka), while both 
Cii and Cm give the topology shown in[5fb). We now determine which elements 
of 5*6 leave these cones fixed (stabilizer) and how many copies of each cone give 
the same labeled tree topology: 
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180 
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solid angle (approx.) 


2.888-10-3 


1.848 ■ 10-3 


2.266-10-3 




(a) (b) 

Fig. 5. The two possible topologies for trees with six leaves, with edges connect- 
ing to leaves shrunk to zero. 

Thus, the input space R^^ is divided into 450 cones, 90 of type I and 180 each 
of types II and III. There are 15 different ways of assigning labels to the tree 
topology in Fig.OJa), and for each of these there are six copies of Ci whose union 
describes the set of input vectors resulting in that topology. For the topology in 
Fig. [DJb) we get 90 ways of assigning labels to the leaves, each corresponding to 
a union of two copies of Cn and two copies of Cm. 

The above table also gives the solid angles of the three cones. In the five 
taxa case, any two cones can be mapped onto one another by the action of the 
symmetric group, which is unitary. Therefore all thirty cones have the same 
solid angle, which must be 1/30. However, in the six taxa case, we get different 
solid angles, and we see that about three thirds of the solid angle at the origin 
are taken by the cones of types // and ///. Thus, on a random vector chosen 
according to any probability law which is symmetric around the origin (e.g., 
standard normal distribution), NJ will output the tree topology of Fig. O^b) 
with probability about 3/4. 

On the other hand, any labeled topology of the type in Fig. [5l^a) belongs 
to six cones of type /, giving a total solid angle of « 1.73 • 10~^, whereas any 
labeled topology of the type in Fig.O^b) belongs to two cones each of type // and 
///, giving a total solid angle of only « 0.82 • 10~^, which is half as much. This 
suggests that reconstructing trees of the latter topology is less robust against 
noisy input data. 



5 Simulation results 



In this section we will analyze how the tree metric for a tree and pairwise dis- 
tances estimated via the maximum likelihood estimation lie in the polyhedral 
subdivision of the sample space. In particular, we analyze subtrees of the two 
parameter family of trees described by [13] • These are trees for which the NJ 
algorithm has difficulty in resolving the correct topology. In order to understand 
which cones the data lies in, we simulated 10,000 data sets on each of the two 
tree shapes, Ti and T2 (Fig.[B]) at the edge length ratio, a/b = 0.03/0.42 for se- 
quences of length 500BP under the Jukes-Cantor model [9] . We also repeated the 
runs with the Kimura 2-parameter model . They are the cases (on eight taxa) 
in [13] that the NJ algorithm had most difficulties in their simulation study (also 
the same as in [IH])- Each set of 5 sequences are generated via evolver from 
PAML package [T^j under the given model, evolver generates a set of sequences 
given the model and tree topology using the birth-and-death process. For each 
set of 5 sequences, we compute first pairwise distances via the heuristic MLE 
method using a software f astDNAml [T^]. To compute cones, we used MAPLE and 
polymake. 

T T 

h a 



b b 



b a 



b I b 

b b 



Fig. 6. Ti and T2 tree models which are subtrees of the tree models in |13j . 

To study how far each set of pairwise distances estimated via the maximum 
likelihood estimation (which is a vector y in M^) lies from the cone, where the ad- 
ditive tree metric lies, in the sample space, we calculated the ^2-distance between 
the cone and a vector y. 

Suppose we have a cone C defined by hyperplanes rii, . . . , n^, i.e., 

C = {x I (rij, x) > for i = 1, . . . , r}, 

and we want to find the closest point in C from some given point v. Because C 
is convex, for ^2-norm there is only one such point, which we call u. If v S C 
then u = V and we are done. If not, there is at least one with (ni,v) < 0, 
and u must satisfy (ni,u) = 0. 

Now the problem reduces to a lower dimensional problem of the same kind: 
We project v orthogonally into the hypcrplane H defined by (ni,x) = and call 
the new vector v. Also, C Ci H is a facet of C, and in particular a cone, so we 
proceed by finding the closest point in this cone from v. 
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Fig. 7. Distances of correctly (top) and incorrectly (bottom) classified input 
vectors to the closest incorrectly/correctly classified vector. 



We say an input vector (distance matrix) is correctly classified if the vector is 
in one of the cones where the vector representation of the tree metric (noiseless 
input) is. We say an input vector is incorrectly classified if the vector is in the 
complement of the cones where the vector representation of the tree metric is. 
For input vectors (distance matrices) which are correctly classified by the NJ 
algorithm, we compute the minimum distance to any cone giving a different tree 
topology. This distance gives a measure of robustness or confidence in the result, 
with bigger distances meaning greater reliability. The results are plotted in the 
left half of Fig. [7] and in Fig. [S] Note that the distance of the noiseless input, 
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Fig. 8. Mean and variance of the distances of correctly classified vectors from 
the nearest misclassified vector. 





JC 


Kimura2 




Tl T2 


Tl T2 


# of cases 


6,419 3,559 


6,205 5,533 


Mean 


0.0594 0.0331 


0.0951 0.0761 


Variance 


0.0203 7.39 ■ lO""* 


0.0411 3.481 ■ 10"^ 



Fig. 9. Mean and variance of the distances of misclassified vectors to the nearest 
correctly classified vector. 

i.e., the tree metric from the tree we used for generating the data samples, gives 
an indication of what order of magnitude to expect with these values. 

For input vectors to which the NJ algorithm returns with a tree topology 
different from the correct tree topology, we compute the distances to the two 
cones for which the correct answer is given and take the minimum of the two. 
The bigger this distance is, the further we are off. The results are shown in the 
right half of Fig. [7] and in Fig. M 

From our results in Fig. [8] and Fig. [9l one notices that the NJ algorithm 
returns the correct tree more often with T2 than with Ti. These results are 
consistent with the results in |15|llj . Note that any possible quartet in Ti has 
a smaller (or equal) length of its internal edge than in T2 (see Fig. Gascuel 
and Steel defined this measure as neighhorliness [15j . Mihaescu et al. showed 
that the NJ algorithm returns the correct tree if it works correctly locally for 
the quartets in the tree [11] . The neighhorliness of a quartet is one of the most 
important factors to reconstruct the quartet correctly, i.e., the shorter it is the 
more difficult the NJ algorithm returns the correct quartet. Also Fig. [7] shows 
that most of the input vectors lie around the boundary of cones, including the 
noiseless input vector (the tree metric) . This shows that the tree models Ti and 
T2 are difficult for the NJ algorithms to reconstruct the correct trees. All source 
code for these simulations described in this paper will be available at authors' 
websites. 

6 Open problems 

Question 1. Can we use the NJ cones for analyzing how the NJ algorithm works 
if each pairwisc distance is assumed to be of the form Do + £ where Do is the un- 
known true tree metric, and e is a collection of independent normally distributed 
random variables? We think this would be very interesting and relevant. 



Question 2. With any n, is there an efficient method for computing (or approxi- 
mating) the distance between a given pairwise distance vector and the boundary 
of the NJ optimahty region which contains it? This problem is equivalent to pro- 
jecting a point inside a polytopal complex P onto the boundary of P. Note that 
the size of the complex grows very fast with n. How fast does the number of 
the complex grow? This would allow assigning a confidence score to the tree 
topology computed by the NJ algorithm. 
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